Documentation Fundamentals
Markdown, README, and Codebooks
Why Documentation Matters
Good documentation essential for reproducible research. Without it, even you won’t understand your own work six months later. Documentation serves three audiences: your future self, your collaborators, and the broader research community.
Key Definitions
Before diving in, let’s clarify terms that are often used interchangeably but mean different things:
| Term | What it is | Typical format |
|---|---|---|
| README | Project overview and setup instructions | .md, .txt, .pdf |
| Codebook | Detailed variable-level documentation | .pdf, .xlsx |
| Data dictionary | Technical specification of variables (often synonymous with codebook) | .xlsx, .csv, .txt |
| Data lineage | The path data takes from source to final form | Diagram or narrative |
| Metadata | Data about data (when collected, by whom, how) | Various |
About Markdown
Markdown is a lightweight markup language that’s become the standard for documentation in data science. It’s readable as plain text but renders nicely in browsers and editors.
Tools for working with Markdown
- Quarto — the successor to R Markdown, works with R, Python, Julia
- Online Markdown editor — for quick testing
- Pandoc — converts between formats (md → docx, pdf, html)
- Dillinger — another online editor with live preview
Quick Markdown reference
# Heading 1
## Heading 2
**bold** and *italic*
- bullet point
1. numbered list
[link text](url)
`inline code`What is a Good README?
A README is the front door to your project. Someone should be able to understand what your project does, how to use it, and where to find things—all from reading the README.
Key Ingredients
A complete README for a research project should include:
- Overview — What is this project? What question does it answer?
- Data sources — Where does the data come from? Any access restrictions?
- File structure — What’s in each folder? Which scripts run in what order?
- Requirements — Software, packages, and versions needed
- Instructions — How to run the analysis from start to finish
- License — Terms for reuse (MIT, CC-BY, etc.)
- Contact — Who to ask questions
Data Description Checklist
For each dataset in your project, document:
- Name and file format (csv, parquet, xlsx)
- Number of observations and variables
- Unit of observation (person, firm-year, country-month)
- Time coverage and geographic scope
- Key variables with brief descriptions
- Missing data: how much and why
- Data lineage: source → processing → final structure
Examples of Good READMEs
Reproduction packages
- Békés-Kézdi (2021) Hotels dataset — clean, minimal, effective
- Koren-Pető (2021) Business disruptions from social distancing | PDF version — comprehensive research package
Templates and guides
- Make a README — interactive guide with examples
- Social Science Data Editors Template — journal-standard template
- AEA Data Editor guidance — requirements for top economics journals
What is a Codebook (Variable Dictionary)?
A codebook provides detailed, variable-level documentation. While the README gives the big picture, the codebook tells you exactly what Q47_recoded means.
What to Include for Each Variable
| Element | Example |
|---|---|
| Variable name | income_hh |
| Label | Household monthly income |
| Type | Numeric (continuous) |
| Unit/metric | EUR, monthly |
| Valid range | 0–999999 |
| Coding for categories | 1=Low, 2=Medium, 3=High |
| Missing values | -99 = refused, NA = not asked |
| Share missing | 4.2% |
| Notes | Top-coded at 99th percentile |
Examples of Good Codebooks
- Békés-Kézdi (2021) Bisnode dataset variables — clean spreadsheet format
- Reif (2022) Illinois Wellness codebook — plain text, version controlled
- On earnings data – used in this course
Tips for AI-Assisted Documentation
LLMs can significantly speed up documentation, but require careful verification.
What AI does well
- Summarizing long codebooks
- Generating first drafts of variable descriptions
- Suggesting what’s missing from your documentation
- Converting between formats (e.g., codebook PDF → markdown table)
What requires human oversight
- Verifying variable definitions match actual data
- Checking that coded values (1, 2, 3…) match the stated meaning
- Ensuring coverage statistics are accurate
- Confirming data lineage is correct
A practical workflow
- Start by having a first look to get a feel. Look at documensts, check the index. Open data.
- Upload your codebook/data to the LLM
- Ask for a structured summary
- Verify 3-5 variables manually against the source
- Iterate: ask AI to fix errors you find
- Final human review before publishing
Further Reading
- Crystal Lewis, Data Management in Large-Scale Education Research — practical guide to data structure
- TIER Protocol — comprehensive reproducibility framework